<a href=”{{ site.baseurl }}“>
Pericytes are a type of cell that enwraps small blood vessels
throughout all vascularized tissues of the body. Pericytes are thought
to be a key effector cell that modulates blood vessel behavior and can
profoundly influence clinical outcomes in chronic disease. However,
pericytes are a difficult cell type to study because they lack exclusive
cellular markers, making identification and isolation difficult.
Currently, pericytes are identified by a panel of markers, their
cellular shape, and their perivascular position relative to capillaries
(Fig 1) (1-3). Recently, a study by the Yuan Lab [1] at
Harvard Medical School sought to identify novel pericyte-specific
markers using single cell RNA-seq from the Tabula Muris dataset. The
Tabula Muris is a compendium of single cell transcriptome data from the
model organism Mus musculus, containing nearly 100,000 cells from 20
organs and tissues [2].
The authors used UMAP to visualize gene expression from
clustering with a KNN graph based on Euclidean distance generated with
linear dimension reduction via PCA. Pericyte associated clusters were
identified with high co-expression of two pericyte markers, CSPG4 and
PDGFRB. A subclass of cells in this group with exceptionally high
co-expression were identified as “stringent” pericytes. Differential
gene expression analysis of stringent pericytes against all other cell
types yielded a panel of novel candidate pericyte markers. While this
study identified interesting new potential pericyte markers for
scientific research, there are potential queries with the methodology
used for the processing pipeline that may yield a better understanding
of the results:
These queries motivate the following research
questions:
1. Will the same overall trends in clustering still appear if the
datasets are not integrated across tissue types?
2. Does the clustering approach used in the study effectively delineate
between the manually annotated cell types from Tabula Muris?
3. Do the pericyte designated clusters align with the manual annotations
that already exist in the database?
Primary Paper and Dataset References
[1] S.-H. Baek, E. Maiorino, H. Kim, K. Glass, B. A. Raby, K. Yuan,
Single Cell Transcriptomic Analysis Reveals Organ Specific Pericyte
Markers and Identities. Frontiers in Cardiovascular Medicine. 9 (2022)
(available at https://www.frontiersin.org/articles/10.3389/fcvm.2022.876591).
[2] The Tabula Muris Consortium., Overall coordination., Logistical
coordination. et al. Single-cell transcriptomics of 20 mouse organs
creates a Tabula Muris. Nature 562, 367–372 (2018). https://doi.org/10.1038/s41586-018-0590-4.
Primary Packages: Seurat, AnnotationHub, ggplot,
tidyverse, pheatmap.
Supporting Packages: cowplot, RColorBrewer,
BiocManager, stringr
>> Code for loading packages and initializing parallel processing.
# Note: this code assumes that the working directory is set to the source file
# location
# Install required Base Packages
base_packages <- c("Seurat", "ggplot2", "tidyverse", "cowplot", "RColorBrewer",
"BiocManager", "stringr", "pheatmap", "future", "rmdformats",
"memuse")
install.packages(setdiff(base_packages, rownames(installed.packages())))
# Install required Bioconductor Packages
biocm_packages <- c("AnnotationHub", "ensembldb")
bioc_installs <- setdiff(biocm_packages, rownames(installed.packages()))
if (length(bioc_installs)) {BiocManager::install(bioc_installs) }
# Load base packages
lapply(base_packages, library, character.only = TRUE)
# Load Bioconductor packages packages
lapply(biocm_packages, library, character.only = TRUE)
# Setup multicore processing
plan(multisession, workers = max(c(parallelly::availableCores()-2,1)))
# Set max ram to 90% of available
options(future.globals.maxSize = as.numeric(memuse::Sys.meminfo()[[2]]) *.9)For the purposes of clarity and reproducibility, below are summaries of the processing pipeline used in this project, an explanation of the gene annotation database, supporting files necessary to carry out the analysis, and the source code.
Below are function parameters that were used in the reference
publication for processing. 1. NormalizeData(normalization.method =
“LogNormalize”, scale.factor = 10000)
2. FindVariableFeatures(selection.method = “vst”, nfeatures =
2000)
3. ScaleData(model.use = “linear”,)
4. RunPCA()
5. Dimensionality: Bladder: 30, Kidney: 45, Lung and Heart: 45 5.
FindNeighbors(dims = 1:dimensionality)
6. FindClusters(resolution = 0.5, algorithm=1)
7. RunUMAP()
Seurat single cell datasets typically record gene expression by
their ensemble ID which are unique static codes to identify features in
the genome. However, they are not human readable, so we must convert
ensemble IDs to the gene name with an annotated gene database.
Processing pipelines typically use bioMart or AnnotationHub as a
reference database. AnnotationHub was used because it maintains a 1:1
mapping from ensemble ID’s to gene names, while bioMart does not because
it can have a one to many relationship between ensemble IDs and entrez
IDs.
>> Code for annotating genes with gene name.
# Check for gene annotation file
if (!file.exists(paste0(getwd(),"/data/annotations_Mm.csv"))) {
# Connect to Annotation Hub as the source for gene annotations
ah <- AnnotationHub()
# Access the Ensemble database for organism
ahDb <- query(ah, pattern = c("Mus musculus", "EnsDb"), ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>% mcols() %>% rownames() %>% tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]
# Extract gene-level information from database
annotations <- genes(edb, return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)
# Export table to csv
data.table::fwrite(annotations, file = paste0(getwd(),"/data/annotations_Mm.csv"))
}
# Load ensembl <-> gene annotation conversion table
gene_annotations <- data.table::fread(file = paste0(getwd(),"/data/annotations_Mm.csv"),
header = TRUE)Seurat datafiles provided by the Tabula Muris were used in this analysis for bladder, heart, kindey, and lung tissues. Data sets were renamed to [tissue].rds and saved in the /data/ subfolder of the project repository.
Direct Link to Seurat datafiles found on Figshare.
bladder.rds
348 MB, MD5 checksum: 12349521850bed75aaba9a4abb5daf48
heart.rds
81 MB, MD5 checksum: f4740230d092fe5c072a3eacb5f533e6
kidney.rds
378 MB, MD5 checksum: afae3d5895401126c51090f5d6f9299a
lung.rds
752 MB,MD5 checksum: f6a12d88b19e7ecd94256c0f0f0ed284
Note: Tabula Muris already performed based QC with the data, such as exlcuding cells with a high fraction of mitochondrial DNA, too few or to many features in a cell, and features in too few cells.
>> Code for processing Seurat pipeline.
# Get dataset filenames (named by tissue type, see Section 2.3.1
data_filenames <- str_replace(list.files(paste0(getwd(),'/data/'),pattern='rds',
full.names = F), '.rds','')[c(4,2,3,1)]
# Specify dimensions for linear dimensional reduction (used by authors in paper)
dim_thresh <- c(30,35,45,35)
# Specify clusters of interest for each tissue
clusters_oi <- list(13,c(11,10),19,24)
# Load all Seurat objects into list
seurat_objs <- list()
# Process and cluster each tissue data set individually (do not integrate)
if (!file.exists(paste0(getwd(),"/results/seurat_objs.rdata"))) {
for(x in seq_along(data_filenames)){
# Load seurat object
seurat_obj <- readRDS(paste0(getwd(),'/data/', data_filenames[x], '.rds'))
# Create cleaned/pretty cell type labels
clean_labels <- function(x)
str_replace_all(str_replace_all(x, paste0(data_filenames[x]," "),""),
c("cell "="", "^nan$" = "unknown"))
seurat_obj@meta.data$cell_label <-
factor(str_to_title(clean_labels(seurat_obj@meta.data$free_annotation)))
# Normalize expression by fraction of total, * 10,000, and log-transform
seurat_objs[[x]] <- NormalizeData(object = seurat_obj)
# Features selection: id features with highest single cell variation (used for PCA)
seurat_objs[[x]] <- FindVariableFeatures(object = seurat_objs[[x]])
# Eliminate uninteresting sources of variation (batch effects, technical noise, cell cycle)
# By regressing out these signals with linear models and outputting the residuals
seurat_objs[[x]] <- ScaleData(object = seurat_objs[[x]])
# Perform PCA on residuals to compress dataset and reduce noise from inconsequential genes
# (Using highly variable features identified earlier)
seurat_objs[[x]] <- RunPCA(object = seurat_objs[[x]])
# Automated alternative for determining number of PCs to include
# JackStrawPlot(object = seurat_objs[[x]], PCs = 1:50)
# Construct KNN graph refined with Jaccard similarity between local neighborhoods
seurat_objs[[x]] <- FindNeighbors(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
# Cluster cells with Louvain algorithm (default)
seurat_objs[[x]] <- FindClusters(object = seurat_objs[[x]], resolution = 0.5)
# Run non-linear dimension reduction to reveal underlying manifold of the data
seurat_objs[[x]] <- RunUMAP(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
}
# Export seurat objects to avoid recomputation
names(seurat_objs) <- data_filenames
save(seurat_objs, file = paste0(getwd(),"/results/seurat_objs.rdata"))
} else {
# If seurat datafiles already computed, load from disk
load(paste0(getwd(),"/results/seurat_objs.rdata"))
}>> Code for data visualizations.
# Visualizations of dimensions, clusters, and gene expression
# Objects below are all ggplot objects
gg_elbows <- list() # elbow plots
gg_umaps <- list() # unsupervised umap plots
gg_dots <- list() # avg expression dot plots
gg_umap_dot <- list() # merged umap and dot plots (above vars)
gg_annot_umap = list() # annotated umap plot with tabula muris cell types
gg_annot_table= list() # composiiton of umap clusters by tabula muris cell types
# Visualizations
for(x in seq_along(data_filenames)){
# Convert cell type labels
remove_list = c("Epcam","Pecam", data_filenames[x], "cell")
#
clean_labels = function(k) str_replace_all(str_replace_all(k, paste(remove_list,collapse="|"),""),
c("\\s+"=" ", "^nan$" = "unknown"))
seurat_objs[[x]]@meta.data$cell_label <-
factor(str_to_title(str_trim(clean_labels(seurat_objs[[x]]@meta.data$free_annotation))))
# Determine number of PCs to use
gg_elbows[[x]] <- ElbowPlot(seurat_objs[[x]], ndims = 50, reduction = "pca") +
ggtitle(tools::toTitleCase(data_filenames[x]), ) +
geom_vline(xintercept = dim_thresh[x], color = "red") +
theme(plot.title = element_text(hjust = 0.5))
save_plot(paste0(getwd(),'/results/elbow_', data_filenames[x], '.png'),
gg_elbows[[x]], base_width = 3, base_height = 3)
# Visualize clusters
gg_temp <- DimPlot(seurat_objs[[x]], reduction = 'umap', label = FALSE, repel = TRUE) +
ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]),
format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE))) +
theme_void() + theme(legend.position="none",plot.title = element_text(size=12,hjust = 0.5))
gg_umaps[[x]] <- LabelClusters(gg_temp, id = "ident", fontface = "bold", size = 6)
# Create a dot plot of expression of PC markers
gg_dots[[x]] <- DotPlot(seurat_objs[[x]], features = c("ENSMUSG00000032911","ENSMUSG00000024620")) +
scale_x_discrete(labels=c('CSPG4', 'PDFRB'),position = "top") + xlab("") + ylab("Cluster") +
scale_y_discrete(position = "right") + #scale_color_gradient(title) +
scale_size_area(breaks=c(25,50,75), limits = c(0, 75)) +
scale_color_gradient(breaks = rev(c(0,1,2)), limits = c(0, 2.5), low = "#D1CFD4", high = "#1E0BFE")+
guides(size=guide_legend(title="% Exp."), colour = guide_colourbar(title="Mean Exp.")) +
theme(text=element_text(size=12),plot.margin = unit(c(0, 0, 0, 0), "cm"),
axis.text.x = element_text(size =12,angle = 45, hjust =0), axis.text.y = element_text(size =12),
axis.ticks=element_blank(),panel.grid.major=element_blank(),
legend.position = "none",
panel.grid.minor = element_blank())
# Create grid of cluster and expression table, then add legend at bottom
gg_cluster_expr <- plot_grid(gg_umaps[[x]], gg_dots[[x]], ncol = 2, rel_widths = c(3, 1))
# Add a second row with a legend
legend <- get_legend(gg_dots[[x]]+theme(legend.position = "bottom",legend.justification="center"))
gg_umap_dot[[x]] <- plot_grid(gg_cluster_expr, legend, nrow = 2, rel_heights = c(1, .1))
save_plot(paste0(getwd(),'/results/umap_organ_', data_filenames[x], '.png'),
gg_umap_dot[[x]], base_width = 8, base_height = 6)
# Visualize Cluster with cell type annotated by atlas
gg_annot_umap[[x]] <- DimPlot(seurat_objs[[x]], reduction = 'umap', group.by = "cell_label",
label = FALSE, repel = TRUE) +
ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]),
format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE))) +
theme(legend.position="bottom",text = element_text(size = 9),
plot.title = element_text(size=12,hjust = 0.5), legend.key.size = unit(0, 'cm'),
legend.spacing = unit(0.1, 'cm'), legend.margin = unit(0.1, 'cm')) +
guides(color=guide_legend(nrow=c(8,3,5,3)[x], byrow=TRUE, override.aes = list(size=3)))
gg_annot_umap[[x]]
# Calculate percentage of cell type per cluster
tab <- table(Assigned=seurat_objs[[x]]@meta.data$cell_label, Clusters=seurat_objs[[x]]$seurat_clusters)
perc_tab <- sweep(tab, 2, colSums(tab ), FUN = '/')
gg_annot_table[[x]] <- pheatmap(perc_tab, color = colorRampPalette(c('white','blue'))(10),
cluster_rows = F, cluster_cols = F, fontsize = 9)
# gg_annot[[x]] <- plot_grid(gg_annot_umap[[x]], gg_annot_table[[x]][[4]], nrow = 2, rel_widths = c(1, 1))
# save_plot(paste0(getwd(),'/results/umap_annot_', data_filenames[x], '.png'),
# gg_annot[[x]], base_width = 2*6, base_height = 2*6)
} A significant step in processing single cell gene expression data is the linear dimension reduction that compresses datapoints into hyperplanes formed by a linear combination of the features within the dataset. This data used the same dimensional thresholds that the authors used in the original study because the thresholds seemed reasonable (determined with tissue-specific elbow plots that visualized variation captured by each dimension).
Figure 2.4.1: Standard deviation captured by
increasing dimensions of linear reduction of the lung gene expression
data with PCA. Authors set the dimensionality for the bladder dataset to
30 dimensions.
Figure 2.4.2: Standard deviation captured by
increasing dimensions of linear reduction of the heart gene expression
data with PCA. Authors set the dimensionality for the heart dataset to
35 dimensions.
Figure 2.4.3: Standard deviation captured by
increasing dimensions of linear reduction of the kidney gene expression
data with PCA. Authors set the dimensionality for the kidney dataset to
45 dimensions.
Figure 2.4.4: Standard deviation captured by
increasing dimensions of linear reduction of the bladder gene expression
data with PCA. Authors set the dimensionality for the lung dataset to 35
dimensions.
To investigate effect of leaving data unintegrated for pericyte marker analysis, the data was reprocessed in Seurat following the standard workflow (Fig 2.1). The same settings were used as the original study, including the dimensionality chosen for each dataset (meaning the same number of PCA principal components, Fig 2.4.1-4). UMAP clustering was used to visualize the data with average expression of pericyte markers CPSG4 and PDGFRB visualized, along with a heatmap of the cell type composition of each cluster according to annotations from the Tabula Muris atlas.
Figure 3.2.1A: UMAP clustering of gene expression from single cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 156 cells comprising cluster 13 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 284 cells from cluster 16 in the study).
Figure 3.2.1B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.2A: UMAP clustering of gene expression from single cells isolated from heart issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 394 cells comprising clusters 10 and 11 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 392 cells from clusters 10 and 12 in the study).
Figure 3.2.2B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.3A: UMAP clustering of gene expression from single cells isolated from kidney issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 229 cells comprising cluster 19 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 358 cells from cluster 17 in the study).
Figure 3.2.3B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Figure 3.2.4A: UMAP clustering of gene expression from single cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 62 cells comprising cluster 24 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 345 cells from clusters 10 and 11 in the study).
Figure 3.2.4B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.
Overall, the clustering approach without data integration yielded similar pericyte-like clusters as those found in the original study. However, for all tissues, there appeared to be a higher fraction of cells expressing both pericyte markers with this modified clustering approach, suggesting that avoiding integration resulted in more homogenous and purified pericyte-like clusters. Compared to the original study, about 10% of cells in lung tissue expressed both markers (compared to 35%), 25% of cells in heart tissue (compared to 50%), 30% of cells in kidney tissue (compared to 50%), and 10% of cells in bladder (compared to 25%).
When compared to the cell type annotation provided by the Tabula Muris database, the unsupervised clustering performed reasonably well in separating them into distinct groups (Figure 3.2.[1-4]B). For lung tissue, clusters appeared to be split between cell types that have a high degree of similarity, such as T cell types or myeloid dendritic cells (Fig 3.2.1B). Similar trends were observed with heart tissue, although there were a multitude of cell types with unknown cellular origins for this tissue type (Fig 3.2.2B). For kidney tissue, overlap was mostly seen for immune cells and endothelial cells (Fig 3.2.3B). For bladder tissue, there was overlap with epithelial cell types, but otherwise most clusters did not map to multiple cell types (Fig 3.2.4B).
The cluster of cells that had high expression of CSPG4 and PDGFRB and deemed to be a source of pericytes did not always match their annotated cell type from the atlas. For lungs, the pericyte cluster were labeled as pericytes, but were labeled as smooth muscle cells in the heart, stroma mesangial cells in the kidney, and endothelial cells in the bladder. These discrepancies will require further follow up with the literature and more detailed marker expression of these groupings, and also highlights the difficulty in determining clean annotations for cell type.
Future Questions
1. Does that marker expression of the stringent pericyte cluster contain
various pericyte markers shown in the literature?
2. Are there cells with marker expression profiles that match pericytes
found in other clusters?
3. Will the same results be observed if the data is reprocessed with
recent advances in transcriptomic analysis (4)?
Riew TR, Jin X, Kim S, Kim HL, Lee MY. Temporal dynamics of cells expressing CSPG4 and platelet-derived growth factor receptor-β in the fibrotic scar formation after 3-nitropropionic acid-induced acute brain injury. Cell Tissue Res. 2021 Sep;385(3):539-555. doi: 10.1007/s00441-021-03438-3. Epub 2021 Apr 17. PMID: 33864501.
Z. Jiang, T. Feng, Z. Lu, Y. Wei, J. Meng, C.-P. Lin, B. Zhou, C. Liu, H. Zhang, PDGFRb+ mesenchymal cells, but not CSPG4+ mural cells, contribute to cardiac fat. Cell Reports. 34, 108697 (2021).
H. van Splunder, P. Villacampa, A. Martínez-Romero, M. Graupera, Pericytes in the disease spotlight. Trends in Cell Biology. 0 (2023), doi:10.1016/j.tcb.2023.06.001.
Allan-Hermann Pool et al, Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references, Nature Methods (2023). DOI: 10.1038/s41592-023-02003-w
©2023 Bruce Corliss